from IPython.display import Image
Image('/home/nikolay/Documents/Medium/UMAP_DataIntegration/graph_integration.png', width=2000)
The idea of graph-based data integration is very simple. Individual data sets can originate from very different statistical distributions. They can be binary, categorical or continous. However, when having been converted into graphs, the individual data sets represent pairwise connections between data points without any "memory" of what statistical process generated the individual data sets. In the graph space, it is straighforward to find an intersection between individual graphs from individual data sets by keeping edges consistently present between the data points across the graphs from individual data sets. The advantage of this approach is that one can apply an appropriate distance metric when converting the raw data into graphs, i.e. working with binary data one might want to apply hamming distance to compute pairwise connections, while working with continuous data it might be sufficient to use the euclidean distance.
To demonstrate the graph-based data integration, we will apply it to the scNMT single cell Omics (https://www.nature.com/articles/s41467-018-03149-4) data set that represents 1) gene expression (scRNAseq), 2) DNA methylation (scBSseq), and 3) open chromatine region (scATACseq) from the same biological cells. We will start with reading scNMTseq data that is scRNAseq, scATACseq and scBSseq, filtering the data , e.g. removing lowly expressed genes, and checking the data distributions.
import pandas as pd
scRNAseq = pd.read_csv('/home/nikolay/Documents/Medium/UnsupervisedOMICsIntegration/scRNAseq.txt', sep = '\t')
print('Dimensions of scRNAseq data set: ' + str(scRNAseq.shape))
scBSseq = pd.read_csv('/home/nikolay/Documents/Medium/UnsupervisedOMICsIntegration/scBSseq.txt', sep = '\t')
print('Dimensions of scBSseq data set: ' + str(scBSseq.shape))
scATACseq = pd.read_csv('/home/nikolay/Documents/Medium/UnsupervisedOMICsIntegration/scATACseq.txt', sep = '\t')
print('Dimensions of scATACseq data set: ' + str(scATACseq.shape))
There are aonly 113 cells in the scNMT multi-Omics data set, i.e very few and it is not a high-throughput technology yet, but on the other hand this is good for demonstration purposes as we can quite easily track all pairwise connections between the cells. Now we will remove genes with the mean counts below 1 across the cells. In addition, we will apply a log-transformation to make the Poissson distributed gene expression counts to look more or less as if they are coming from Normal distribution.
import numpy as np
scRNAseq = scRNAseq.loc[:, scRNAseq.mean(axis = 0) > 1]
scRNAseq = np.log10(scRNAseq + 1)
print('Dimensions of scRNAseq data set: ' + str(scRNAseq.shape))
Let us plot the log-converted single cell gene expression data and see that it looks tolerable to apply Euclidean distance as a measure of similarity in gene expression between the cells.
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(rc = {'figure.figsize':(20, 15)}, font_scale = 2)
sns.distplot(scRNAseq.mean(axis = 0), bins = 100)
plt.title('scRNAseq', fontsize = 25)
plt.xlabel('log10(Gene Expression)', fontsize = 25)
plt.show()
The single cell gene expression (scRNAseq) data is clearly continuous, let us have a look at the data matrix:
scRNAseq
However, when we plot scBSseq and scATACseq data, it is quite clear that they follow a bimodal distribution, therefore it would not be appropriate to use Euclidean distance to measure similarity between cells using this type of rather binary data.
import seaborn as sns
import matplotlib.pyplot as plt
figure = plt.figure(figsize = (20, 8))
plt.subplot(121)
plt.hist(scBSseq.mean(axis = 0), bins = 100, color = 'darkgreen')
plt.title('scBSseq', fontsize = 20)
plt.xlabel('Mean Methylation Status', fontsize = 20)
plt.subplot(122)
plt.hist(scATACseq.mean(axis = 0), bins = 100, color = 'darkred')
plt.title('scATACseq', fontsize = 20)
plt.xlabel('Mean Open Chromatin Region Status', fontsize = 20)
figure.tight_layout()
plt.show()
If we look at the scBSseq data matrix, the elements vary between 0% and 100% implying a whether a certain CpG site is methylated (100%) or unmethylated (0%). In reality, not all elements of the matrix are strictly either 0% or 100%, this might be due to technical sampling and sequencing reasons. However, with a good certainty one can treat the scBSseq data (the same is valid for scATACseq data as well) as binary data. To make the data explicitly binary, we will recode all counts below 50% as 0 and above 50% as 1.
scBSseq
import numpy as np
scBSseq = pd.DataFrame(np.where(scBSseq < 50, 0, 1), index = scBSseq.index, columns = scBSseq.columns)
scBSseq
We will apply a similar binarization procedure to teh scATACseq data. You do not have to do it this way, but because of methodological reasons, it is worthwile demonstrating how continuous data (scRNAseq) can be integrated with binary data (scBSseq, scATACseq) using graph intersect approach.
scATACseq
import numpy as np
scATACseq = pd.DataFrame(np.where(scATACseq < 50, 0, 1), index = scATACseq.index, columns = scATACseq.columns)
scATACseq
Now, we will construct a KNN graph using the raw gene expression data from the scRNaseq data set. We will aply Euclidean similarity metric betwen the cells and keep 30 nearest neighbors for each cell. In fact, when starting with a fully-connected graph, the KNN method in scikitlearn will check the weights for the edges of each pair of data points and select 30 neighbors with largest weights for each data point. As a result, we will get a KNN adjacency matrix, where 1 means cells are connected and 0 means disconnected.
import numpy as np
import igraph as ig
from sklearn.neighbors import kneighbors_graph
n_neighbor = 30
cell_ids = scRNAseq.index
knn_scRNAseq = kneighbors_graph(scRNAseq.values, n_neighbor, metric = 'euclidean', mode = 'connectivity').toarray()
knn_scRNAseq = pd.DataFrame(knn_scRNAseq, columns = cell_ids, index = cell_ids)
knn_scRNAseq
Now we will make a data frame showing for each pair of points whether they are connected or disconnected. We will select only connected pairs of cells.
knn_scRNAseq = knn_scRNAseq.stack().reset_index()
knn_scRNAseq = knn_scRNAseq.rename(columns = {'level_0': 'cell1', 'level_1': 'cell2', 0: 'connectivity'})
knn_scRNAseq = knn_scRNAseq.loc[knn_scRNAseq['connectivity'] != 0]
knn_scRNAseq
Let us select a random pair of points, e.g. ESC_F04 and ESC_F07, that seem to be connected in the scRNAseq data set, and track this connection in the scBSseq and scATACseq data sets. Below we will visualize the KNN graph using the MDS layout, and highlight the connection between ESC_F04 and ESC_F07 cells by red edge.
knn_scRNAseq[(knn_scRNAseq['cell1'] == 'ESC_F04') & (knn_scRNAseq['cell2'] == 'ESC_F07')]
import cairocffi
from igraph import Graph, Plot
from IPython.display import Image
from igraph.drawing.text import TextDrawer
knn_scRNAseq = ig.Graph.TupleList([tuple(x) for x in knn_scRNAseq.values], directed = False)
knn_scRNAseq.vs["label"] = knn_scRNAseq.vs['name']
visual_style = {}
visual_style["bbox"] = (800, 600)
visual_style["margin"] = 50
p = Plot("scRNAseq_graph.png", bbox = (800, 600), background = "white")
layout = knn_scRNAseq.layout_mds()
knn_scRNAseq.vs['color'] = ['red' if x['name'] in ['ESC_F04', 'ESC_F07'] else 'gold' for x
in knn_scRNAseq.vs]
knn_scRNAseq.es['color'] = 'rgba(192, 192, 192, 0.3)'
vertex_source = set(['ESC_F04'])
vertex_target = set(['ESC_F07'])
red_edges = knn_scRNAseq.es.select(_source_in = vertex_source, _target_in = vertex_target)
red_edges['color'] = 'red'
red_edges['width'] = 5
p.add(knn_scRNAseq, layout = layout, vertex_size = 8, vertex_label_size = 10, **visual_style)
p.redraw()
ctx = cairocffi.Context(p.surface)
ctx.set_font_size(20)
drawer = TextDrawer(ctx, 'scRNAseq KNN Graph', halign = TextDrawer.CENTER)
drawer.draw_at(300, 30, width = 200)
#plot.show()
p.save()
display(Image('scRNAseq_graph.png', width = 2000))
We already now can observe some sort of clustering of ESC cells and EB cells separately from each other. Now we will do the same graph construction for the scBSseq data. In this case, since scBSseq is a binary data, it is not optimal to measure similarity between cells with the Euclidean distance, the way we did it for scRNAseq data. we will choose the Hamming distance for now, Dice and Jaccard distances might also be a good choice for categorical / binary data.
knn_scBSseq = kneighbors_graph(scBSseq.values, n_neighbor, metric = 'hamming', mode = 'connectivity').toarray()
knn_scBSseq = pd.DataFrame(knn_scBSseq, columns = cell_ids, index = cell_ids)
knn_scBSseq = knn_scBSseq.stack().reset_index()
knn_scBSseq = knn_scBSseq.rename(columns = {'level_0': 'cell1', 'level_1': 'cell2', 0: 'connectivity'})
knn_scBSseq = knn_scBSseq.loc[knn_scBSseq['connectivity'] != 0]
knn_scBSseq[(knn_scBSseq['cell2'] == 'ESC_F04') & (knn_scBSseq['cell1'] == 'ESC_F07')]
Again the ESC_F04 and ESC_F07 cells seem to be connected in the scBSseq data as well. We will proceed with visualizing the scBSseq KNN graph and highlighting the connection between ESC_F04 and ESC_F07 cells.
knn_scBSseq = ig.Graph.TupleList([tuple(x) for x in knn_scBSseq.values], directed = False)
knn_scBSseq.vs["label"] = knn_scBSseq.vs['name']
visual_style = {}
visual_style["bbox"] = (800, 600)
visual_style["margin"] = 70
p = Plot("scBSseq_graph.png", bbox = (800, 600), background = "white")
layout = knn_scBSseq.layout_mds()
knn_scBSseq.vs['color'] = ['red' if x['name'] in ['ESC_F04', 'ESC_F07'] else 'green' for x
in knn_scBSseq.vs]
knn_scBSseq.es['color'] = 'rgba(192, 192, 192, 0.3)'
vertex_source = set(['ESC_F04'])
vertex_target = set(['ESC_F07'])
red_edges = knn_scBSseq.es.select(_source_in = vertex_source, _target_in = vertex_target)
red_edges['color'] = 'red'
red_edges['width'] = 5
p.add(knn_scBSseq, layout = layout, vertex_size = 8, vertex_label_size = 10, **visual_style)
p.redraw()
ctx = cairocffi.Context(p.surface)
ctx.set_font_size(20)
drawer = TextDrawer(ctx, 'scBSseq KNN Graph', halign=TextDrawer.CENTER)
drawer.draw_at(300, 40, width = 200)
#plot.show()
p.save()
display(Image('scBSseq_graph.png', width = 2000))
And finally we will repeat the same operation (constructing and visualizing KNN graph with Hamming distance) for the scATAC data. Luckily, the ESC_F04 and ESC_F07 cells seem to be connected in this scOmic layer as well. This implies that this graph edge is consistently present across the individual scOmics graphs and therefore should be kept in the consensus graph after we have intersected the individual graphs.
knn_scATACseq = kneighbors_graph(scATACseq.values, n_neighbor, metric = 'hamming',
mode = 'connectivity').toarray()
knn_scATACseq = pd.DataFrame(knn_scATACseq, columns = cell_ids, index = cell_ids)
knn_scATACseq = knn_scATACseq.stack().reset_index()
knn_scATACseq = knn_scATACseq.rename(columns = {'level_0': 'cell1', 'level_1': 'cell2', 0: 'connectivity'})
knn_scATACseq = knn_scATACseq.loc[knn_scATACseq['connectivity'] != 0]
knn_scATACseq[(knn_scATACseq['cell1'] == 'ESC_F04') & (knn_scATACseq['cell2'] == 'ESC_F07')]
knn_scATACseq = ig.Graph.TupleList([tuple(x) for x in knn_scATACseq.values], directed = False)
knn_scATACseq.vs["label"] = knn_scATACseq.vs['name']
visual_style = {}
visual_style["bbox"] = (800, 600)
visual_style["margin"] = 50
p = Plot("scATACseq_graph.png", bbox = (800, 600), background = "white")
layout = knn_scATACseq.layout_mds()
knn_scATACseq.vs['color'] = ['red' if x['name'] in ['ESC_F04', 'ESC_F07'] else 'turquoise' for x
in knn_scATACseq.vs]
knn_scATACseq.es['color'] = 'rgba(192, 192, 192, 0.1)'
vertex_source = set(['ESC_F04'])
vertex_target = set(['ESC_F07'])
red_edges = knn_scATACseq.es.select(_source_in = vertex_source, _target_in = vertex_target)
red_edges['color'] = 'red'
red_edges['width'] = 5
p.add(knn_scATACseq, layout = layout, vertex_size = 8, vertex_label_size = 10, **visual_style)
p.redraw()
ctx = cairocffi.Context(p.surface)
ctx.set_font_size(20)
drawer = TextDrawer(ctx, 'scATACseq KNN Graph', halign=TextDrawer.CENTER)
drawer.draw_at(300, 20, width = 200)
#plot.show()
p.save()
display(Image('scATACseq_graph.png', width = 2000))
Great! Now we have constructed 3 KNN graphs from scRNAseq, scBSseq, and scATACseq data sets. Now everything is ready for running intersection of the 3 graphs. The idea of graph intersection, as it beautifully demonstrated in Wolfram Mathworld (https://mathworld.wolfram.com/), is to keep edge consitently present across multiple scOmics data sets. We will is the "intersection" function implemented in the igraph library. some nodes are going to be isolated after we have performed the graph intersection. For simplicity, here, we are going to ignore / delete those nodes and not consider them as separate clusters.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/UMAP_DataIntegration/GraphIntersection.png', width=2000)
from igraph import intersection
knn_consensus = intersection([knn_scRNAseq, knn_scBSseq, knn_scATACseq])
to_delete_ids = [v.index for v in knn_consensus.vs if v.degree() == 0]
knn_consensus.delete_vertices(to_delete_ids)
to_delete_ids
knn_consensus
Let us visualize the consensus graph (the one obtained by intersecting the other 3 graphs) and demonstrate that the connection / edge between ESC_F04 and ESC_F07 cells is still present, we will highlight this edge by red.
visual_style = {}
visual_style["bbox"] = (800, 600)
visual_style["margin"] = 50
p = Plot("consensus_graph.png", bbox = (800, 600), background = "white")
layout = knn_consensus.layout_mds()
knn_consensus.vs['color'] = ['red' if x['name'] in ['ESC_F04', 'ESC_F07'] else 'magenta' for x
in knn_consensus.vs]
knn_consensus.es['color'] = 'rgba(192, 192, 192, 0.3)'
vertex_source = set(['ESC_F04'])
vertex_target = set(['ESC_F07'])
red_edges = knn_consensus.es.select(_source_in = vertex_source, _target_in = vertex_target)
red_edges['color'] = 'red'
red_edges['width'] = 5
p.add(knn_consensus, layout = layout, vertex_size = 8, vertex_label_size = 10, **visual_style)
p.redraw()
ctx = cairocffi.Context(p.surface)
ctx.set_font_size(20)
drawer = TextDrawer(ctx, 'Consensus Graph', halign=TextDrawer.CENTER)
drawer.draw_at(300, 30, width = 200)
#plot.show()
p.save()
display(Image('consensus_graph.png', width = 2000))
Now we can run Leiden (https://www.nature.com/articles/s41598-019-41695-z) graph-based clustering algorithm (modification of Louvain algorithm), that finds 4 clusters present on the consensus graph.
import leidenalg
knn_consensus_cluster = leidenalg.find_partition(knn_consensus, leidenalg.ModularityVertexPartition)
knn_consensus_cluster.modularity
knn_consensus_cluster.membership[0:10]
len(knn_consensus_cluster)
we will color the nodes on the consensus graph by the membership determined by the Leiden graph-based clustering algorithm. The consistent across the scOmics connection between the ESC_F04 and ESC_F07 cells is stil highlighted by red.
visual_style = {}
visual_style["bbox"] = (800, 600)
visual_style["margin"] = 50
p = Plot("consensus_graph.png", bbox = (800, 600), background = "white")
layout = knn_consensus.layout_mds()
pal = ig.drawing.colors.ClusterColoringPalette(len(knn_consensus_cluster))
knn_consensus.vs['color'] = pal.get_many(knn_consensus_cluster.membership)
p.add(knn_consensus, layout = layout, vertex_size = 8, vertex_label_size = 10, **visual_style)
p.redraw()
ctx = cairocffi.Context(p.surface)
ctx.set_font_size(20)
drawer = TextDrawer(ctx, 'Consensus Graph', halign = TextDrawer.CENTER)
drawer.draw_at(300, 30, width = 200)
#plot.show()
p.save()
display(Image('consensus_graph.png', width = 2000))
We can see that the EB cells seem to form a separate (green) cluster, while the ESC cells are now split into at least two (blue and red) clusters. There is a minor group of yellow cells that was found to be the fourth cluster but this probably should be tuned with the resolution parameter of the Leiden algorithm as the yellow cells do not seem to visually stand out as a separate cluster. On the other hand this might be an issues with the layout. Trying anohter graph layout could possible demonstrate that the yellow cells indeed form a distinct cluster.
It turns out that the manual KNN graph construction and intersection is not really necessary as this option is essentially included into UMAP algorithm. UMAP allows for fast individual scOmics graph building and intersection wioth just a few lines of code, see the interesting discussion about mixing different data types here https://github.com/lmcinnes/umap/issues/58.
Here we will overlap graphs from single cell RNAseq and Proteomics data sets from the CITE-seq technology (https://en.wikipedia.org/wiki/CITE-Seq), and produce a consensus / integrative OMICs UMAP plot. The idea is to convert the two data types into graphs, i.e. non-parametric space where they forget their technological differences, and overlap the graphs. The resulting graph will be used for constructing the UMAP embeddings.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/UMAP_DataIntegration/GraphDataIntegration.png', width = 2000)
By graph intersection we mean keeping edges consitently present across the different data sets:
We will start with reading and log-transforming the OMICs data.
import os
import numpy as np
import pandas as pd
from keras.models import Model
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from keras.utils import plot_model
from keras.layers import Input, Dense
from keras.layers.merge import concatenate
os.chdir('/home/nikolay/WABI/Misc/SingleCell/CITEseq/')
scRNAseq = pd.read_csv('scRNAseq.txt',sep='\t')
scProteomics = pd.read_csv('scProteomics.txt',sep='\t')
X_scRNAseq = scRNAseq.values[:,0:(scRNAseq.shape[1]-1)]
Y_scRNAseq = scRNAseq.values[:,scRNAseq.shape[1]-1]
X_scProteomics = scProteomics.values[:,0:(scProteomics.shape[1]-1)]
Y_scProteomics = scProteomics.values[:,scProteomics.shape[1]-1]
X_scRNAseq = np.log(X_scRNAseq + 1)
X_scProteomics = np.log(X_scProteomics + 1)
X_scRNAseq.shape
X_scProteomics.shape
Let us display the UMAP embeddings of individual OMICs data sets. We will start with the scRNAseq dataset. As an optimal number of nearest neighbors we will choose sqrt(N) = sqrt(8617) = 93.
import warnings
warnings.filterwarnings("ignore")
from umap import UMAP
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 20).fit_transform(X_scRNAseq)
model = UMAP(n_components = 2, min_dist = 1, n_neighbors = 93, init = X_reduced[:, 0:2], n_epochs = 1000,
verbose = 2)
umap = model.fit_transform(X_reduced)
plt.figure(figsize = (20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 20)
plt.title('UMAP: scRNAseq', fontsize = 25);
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
Now let us display UMAP embeddings for the scProteomics dataset of the CITE-seq sequencing technology.
import warnings
warnings.filterwarnings("ignore")
from umap import UMAP
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 10).fit_transform(X_scProteomics)
model = UMAP(n_components = 2, min_dist = 0.8, n_neighbors = 93, init = X_reduced[:, 0:2], n_epochs = 1000,
verbose = 2)
umap = model.fit_transform(X_reduced)
plt.figure(figsize = (20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = Y_scProteomics, cmap = 'tab20', s = 20)
plt.title('UMAP: scProteomics', fontsize = 25);
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()
Now we will construct the intersection of the two graphs / simplicial sets, and visualize the resulting embeddings.
import umap
X_reduced_scRNAseq = PCA(n_components = 20).fit_transform(X_scRNAseq)
X_reduced_scProteomics = PCA(n_components = 10).fit_transform(X_scProteomics)
fit1 = umap.UMAP(n_components = 2, min_dist = 1, n_neighbors = 93, n_epochs = 1000,
init = X_reduced_scRNAseq[:, 0:2], verbose = 2).fit(X_reduced_scRNAseq)
fit2 = umap.UMAP(n_components = 2, min_dist = 0.8, n_neighbors = 93, n_epochs = 1000,
init = X_reduced_scProteomics[:, 0:2], verbose = 2).fit(X_reduced_scProteomics)
intersection = umap.umap_. general_simplicial_set_intersection(fit1.graph_, fit2.graph_, weight = 0.45)
intersection = umap.umap_.reset_local_connectivity(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components,
fit1.learning_rate, fit1._a, fit1._b,
fit1.repulsion_strength, fit1.negative_sample_rate,
1000, 'random', np.random, fit1.metric,
fit1._metric_kwds, False)
plt.figure(figsize = (20,15))
plt.scatter(embedding[:, 0], embedding[:, 1], c = Y_scRNAseq, cmap = 'tab20', s = 20)
plt.title('UMAP: scRNAseq + scProteomics', fontsize = 25);
plt.xlabel("UMAP1", fontsize = 22); plt.ylabel("UMAP2", fontsize = 22)
plt.show()